Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
From this summary we can learn that November is the best time to travel. We can also learn that SLC has the least amount of delays by weather.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
In this question, I did a simple filtering and relpacement in order to cleanly show “NaN” using df.replace.
missing_tokens = ["", " ", "NA", "N/A", "null", None]
df.replace(missing_tokens, np.nan, inplace=True)
for col in df.columns:
if "num_of_" in col or "minutes_delayed" in col:
df[col] = pd.to_numeric(df[col], errors="coerce")
df["month"] = df["month"].replace({"Febuary": "February"})
one_row_json = (
df.sample(1)
.replace({np.nan: "NaN"})
.to_json(orient="records", indent=2)
)
one_row_json'[\n {\n "airport_code":"IAD",\n "airport_name":"Washington, DC: Washington Dulles International",\n "month":"November",\n "year":2012.0,\n "num_of_flights_total":6207,\n "num_of_delays_carrier":245.0,\n "num_of_delays_late_aircraft":331,\n "num_of_delays_nas":189,\n "num_of_delays_security":1,\n "num_of_delays_weather":10,\n "num_of_delays_total":777,\n "minutes_delayed_carrier":13652.0,\n "minutes_delayed_late_aircraft":23025,\n "minutes_delayed_nas":-999.0,\n "minutes_delayed_security":16,\n "minutes_delayed_weather":329,\n "minutes_delayed_total":46556\n }\n]'
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
For this table, I decided to look into the avgerage delay hours by airport to be able to easily see which airport was doing the worst. I used lambda to do this.
summary = (
df.groupby("airport_code", dropna=True)
.agg(
total_flights = ("num_of_flights_total", "sum"),
avg_delay_mins = ("minutes_delayed_total", "mean")
)
.assign(
avg_delay_hrs = lambda d: d.avg_delay_mins / 60
)
.drop(columns="avg_delay_mins")
.sort_values("avg_delay_hrs", ascending=False)
)
summary.head()| total_flights | avg_delay_hrs | |
|---|---|---|
| airport_code | ||
| ORD | 3597588 | 7115.672854 |
| ATL | 4430047 | 6816.152273 |
| SFO | 1630945 | 3352.334975 |
| DEN | 2513974 | 3178.457197 |
| IAD | 851571 | 1298.418939 |
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
I did he same type of averaging idea for this question, but I averaged by month. It can clearly be seen that the best month to go is in November, due to the low amount of delays comparatively.
# Drop rows with missing months
month_df = df.dropna(subset=["month"])
month_tbl = (
month_df.groupby("month")
.agg(avg_delay_mins=("minutes_delayed_total", "mean"))
.reset_index()
)
month_order = [
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
]
month_tbl["month"] = pd.Categorical(month_tbl["month"], categories=month_order, ordered=True)
month_tbl["avg_delay_hrs"] = month_tbl["avg_delay_mins"] / 60
month_tbl = month_tbl.sort_values("month")
from lets_plot import *
ggplot(month_tbl) + \
geom_bar(aes(x="month", y="avg_delay_hrs"), stat="identity", fill="#1f77b4") + \
ggtitle("Average Delay Time by Month (in Hours)") + \
xlab("Month") + ylab("Avg Delay (Hours)") + \
theme(axis_text_x=element_text(angle=45, hjust=1)) + \
ggsize(700, 350)According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
This analysis is able to take a deeper look into possible different weather delays that are not just severe. From this, we can see that ATL has the most delays.
# Include and execute your code here
# Make sure values are numeric
for col in ["num_of_delays_weather", "num_of_delays_late_aircraft", "num_of_delays_nas"]:
df[col] = pd.to_numeric(df[col], errors="coerce")
# Fill missing late aircraft delays with column mean
df["num_of_delays_late_aircraft"].fillna(df["num_of_delays_late_aircraft"].mean(), inplace=True)# Include and execute your code here
def compute_weather(row):
weather = row["num_of_delays_weather"] or 0
late_aircraft = 0.30 * row["num_of_delays_late_aircraft"]
spring_summer = ["April", "May", "June", "July", "August"]
nas_pct = 0.40 if row["month"] in spring_summer else 0.65
nas = nas_pct * (row["num_of_delays_nas"] or 0)
return weather + late_aircraft + nas
df["weather_all"] = df.apply(compute_weather, axis=1)# Include and execute your code here
df[[
"airport_code", "month",
"num_of_delays_weather", "num_of_delays_late_aircraft", "num_of_delays_nas",
"weather_all"
]].head()| airport_code | month | num_of_delays_weather | num_of_delays_late_aircraft | num_of_delays_nas | weather_all | |
|---|---|---|---|---|---|---|
| 0 | ATL | January | 448 | -999 | 4598 | 3137.00 |
| 1 | DEN | January | 233 | 928 | 935 | 1119.15 |
| 2 | IAD | January | 61 | 1058 | 895 | 960.15 |
| 3 | ORD | January | 306 | 2255 | 5415 | 4502.25 |
| 4 | SAN | January | 56 | 680 | 638 | 674.70 |
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
SFO has the most delays from weather that are mild and severe.
# Include and execute your code here
df["weather_delay_prop"] = df["weather_all"] / df["num_of_flights_total"]
weather_by_airport = (
df.groupby("airport_code", dropna=True)
.agg(prop_weather_delay=("weather_delay_prop", "mean"))
.reset_index()
.sort_values("prop_weather_delay", ascending=False)
)
from lets_plot import *
(ggplot(weather_by_airport) +
geom_bar(aes(x="airport_code",
y="prop_weather_delay"),
stat="identity",
fill="#1f77b4") +
ggtitle("Proportion of Flights Delayed by Weather (Severe + Mild)") +
xlab("Airport") + ylab("Proportion delayed") +
theme(axis_text_x=element_text(angle=45, hjust=1)) +
ggsize(700, 400))Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
type your results and analysis here
# Include and execute your code here